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Abstract. The paper is devoted to the problem of mapping affine loop 
nests onto distributed memory parallel computers. A method to find 
affine transformations of loop nests for parallel execution and distribution 
of data over processors is presented. The method tends to minimize the 
number of communications between processors and to improve locality of 
data within one processor. A problem of determination of data exchange 
sequence is investigated. Conditions to determine the ability to arrange 
broadcast is presented. 



1 Introduction 

To map algorithms given by sequential programs onto distributed memory paral- 
lel computers is to distribute data and computations to processors, to determine 
an execution sequence of operations and a data exchange sequence. The most im- 
portant problems are: scheduling PP, space-time mapping P|, alignment |1I3I4| . 
determination of data exchange sequence . An essential stage of the solution 
of these problems is to find functions (scheduling functions, statement and array 
allocation functions) satisfying certain constraints. 

One of the preferable parallelization schemes is based on obtaining multi- 
dimensional scheduling functions. Some coordinates of the multi-dimensional 
scheduling functions are used for operations allocation. The other coordinates 
are used for scheduling operations. 

For the program execution time to be as small as possible it is necessary 
to solve an alignment problem. It consists in coordinated operations and data 
allocation to minimize the communications. 

The program execution time depends not only on the execution time of op- 
erations but also on the memory access time. The access time depends on data 
location in the hierarchical memory. Therefore the problem of prompt data reuse 
within one processor (localization problem) is of great importance |Hj . Two kinds 
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of localization, such as localization in time and in space, are available. The time 
localization is used to set the operations execution sequence so that the data be 
reused before it is moved to a lower level memory. The spatial localization allows 
to use the data allocated close to each other in memory. The locality depends 
on the execution sequence of operations. Hence, it is desirable to take it into 
account when scheduling functions are obtained. 

In this paper, a method of simultaneous solution of all mentioned problems 
is suggested. It results in high performance; of parallel code execution. 

After transformation of an algorithm for parallel execution it is necessary to 
determine the data exchange sequence. In many cases utilization of broadcast, 
gather, scatter, reduction, and translation enables to improve the efficiency of 
a parallel program. In this paper, we investigated the problem of determination 
of the data exchange sequence. We suggest the conditions to determine the case 
when the broadcast communications may be used. 



2 Main Definitions 



Let an algorithm be represented by an affinc loop nest. For such algorithms, 
array indices and bounds of loops are affine functions of outer loop indices or 
loop-invariant variables. Let a loop nest contain K statements 5/3 and use L 
arrays ai. By denote the iteration domain of statement 5/?, by Wi denote 
the index domain of array a;. By denote a number of loops surrounding 
statement Sfj. By vi denote dimension of array a;. Then Vg C 2Z^''\ Wi C Z'^' . 
By J e Z"'^ denote the iteration vector, by A'' G denote the vector of outer 
variables, e is the number of these variables. 

Let Fi^i^^q-.Vp Wi denote access function that puts the iteration domain 
V/3 into correspondence with the index domain Wi for the q-th input of elements 
of array ai into instruction Sfj. Suppose Fi_p^q are affine functions: Fi_fj^q(J) = 
Fi,p.qJ + Gi.(,,qN + /e^^^"), where J e Vp, Fi^p.q e ^-'X"^, N e Gz^^g,, G 

Given a statement Sp, a computation instance of 5/3 is called an operation and 
is denoted by SjsiJ). Denote a dependence of operation S(3{J) from operation 

Sa{I) by Sa{I) SjS^J). We consider flow-, anti-, out-, and in-dependences. 
Denote by P a set of pairs of indices such that Sa{I) S/s^J). 

Let ^a.js- Vaj) Va bc dependence function. If Sa{I) — > 5'/3(J), / £ Vq, J G 
Va,f3 ^ V(j, then / = <Pa.p{J)- Suppose <Pa.p arc affine functions: 'Pa./iiJ) = 

^^„,/3G^"«^^ 

3 Multi-Dimensional Scheduling Functions. Data 
Allocation Functions 

Let n = max ns- Let functions t^^^-.Vg 2Z^ assign a vector i^f^ {J\, . . . , 
tl?\j)) to each operation 5/3(7). Suppose t'f^ are affine functions: ti^\j) = 



P < K, 1 <^ <n. 

Functions i^'^'' are called vector scheduling functions if 

rang T^'^) = n^, 1 < /3 < X , (1) 

t^^\j)>iext''Hl), JeVp, leVo,, iiSM)^Sp{J) . (2) 

Here T^'^^ is a matrix whose rows are vectors t'-^'^\ . . . , r^'^'"^ Sa{I) Si3{J) is 
any dependence except in-dependence, notation >iex denotes "lexicographically 
greater or equal to" . 

A set of vector functions t'"^\ ^ < P < K, is called a multi-dimensional 
scheduling. We can use these functions to transform loops assuming the operation 
Sp{J) to be executed at the iteration t^^\j). Thus we interpret elements of the 
vector i*''^^ as indices of the transformed loop nest for the statement 5/3 : t[^'^ 
is the index of the outermost loop, tH^^ is the index of the innermost loop. Note 
that the functions t^^"^ determine permissible transformation of the loop nest, 
i.e., this transformation keeps the execution sequence of dependent operations. 

We consider functions {tf\...,ti^''), r < n, as allocation functions that 
determine spatial mapping of an algorithm to r-dimensional space of virtual 
processors. That is, the values of indices of r external loops of the transformed 
algorithm determine processor coordinates. The values of indices of n — r internal 
loops determine iterations to be executed on the processor. 

Usually we need to take into account the number of processors used to execute 
the program. Then to simplify code generation it is necessary that the following 
conditions be valid tf\j) > t^"^!), J e Vp, I e V^, if 3^(1) -> S'/3(J), 1 < 
^ < r. 

Let functions d'' ' : Wi ^ TZi' assign a vector (d^''' (-f), . . . ,dr\F)) to each 
element a;(F) ofanarraya;. Supposed^''' are affine functions: d!"^\F) = ri^'''^^F+ 
zii^i)N + yi^^, F e Wi, ?7('^«' e Z"', z(''«\iV e ^^ yi^^ e I < I < 
L, 1 < ^ < r. Let element ai{F) be stored in the local memory of the processor 
determined by the coordinates . . . , dr \F)). 

Let us introduce some notation: t^^^ — {t^^'^\ . . . ,T^-^'^\ri^^'^\ . . . ,ri'^^'^\ 
b^^'^\ . . . ,b'^'^'^\ z^'^'^\ . . . , z'^^^^\ai^^, . . . ,aK,i,yi,^, ■ ■ ■ ,yL,^) is a vector, whose 
entries are parameters of functions i^'^^ and d^^ . 

The following proposition gives the condition to be used for finding scheduling 
functions that satisfy condition (Q. 

Proposition 1. Suppose rang T^^^^_^ = f, r < np, where T^^^ is a matrix 
whose rows are vectors t^I^'^\ 1 < * < Suppose s^^^ is a fixed vector of a set 
Sf^ = {s e I r('3'*)s = 0, I < i < ^-l, s ^ 0}, 2 < ^ < n. Then 

rang T^;^^ = r + 1 ifr'^P'^hf ^ 0. 



Condition r^^'^^^Sp 7^ is equivalent to the following inequality in the vector 
form 

|f(«S<f^| > 1 . (3) 

Let v'^"-'^'"^^ be vertices of the polyhedron Va^p, m{a,P) be the number of 
the vertices. Any vertex can be represented in the form wC^'/^^™) = 

j^{a,l3..m)j^ ^ Lgt N^°'> & ZZ" he a. vector whose i-th entry is equal to 

the smallest possible value of the outer variable Ni. Suppose coordinates of the 
vector N can be unlimited large. Then we can show that t'"^\j) — t^^\l) is 
non- negative for all / and J such that Sa{I) Sis{J) iff 

_ ^(a,5)^^^^)^^(a,/3,m) + ^p,^ - a^,^ + T^"'?)^^"'^) > 0, 1 < TO < m{a, (3); 

(^(/3,«) _ r("'«)<l>„^^)i?("''3^™) + - - T("'€)if„,;3 > 0, 1 < m < m{a, f3). 

In the vector-matrix form 

r^^^D^p > 0, > . (4) 

Let introduce in the consideration vector variables ^ and z^^^. The solution 
of is the solution of equations 

^^^'^a,/3-</3 = 0: </3>0, ?'«)Ca,;3-Za,/3 =0, Z„,/5 > . (5) 

The following propositions can be easily proved: 

1) If z^^fj = 0, Za,i3 = in © for aU f , 1 < C < n, then t^'^V) = t'^"\l) for 
all / and J such that 5*0(1) ^ Sf3{J). 

2) If z;^^ > in ©, then tf\j) - t["\l) > for all / and J such that 
SM) ^ Sp{J). 

Thus, to find space-time mapping of an algorithm is to find vectors r^^^ , 1 < 
^ < n, which the following conditions are valid for. Suppose we are searching 
vectors r^^^ , ^ — 1,2, . . . ,n, sequentially. Then condition (0 has to be valid for 
P such that n — ^ + 1 — np — rang T^^^_^. Conditions Q have to be valid for all 
{a, j3) ^ P except the following case. Suppose ^ > r -I- 1 and for some (a, P) G P 
the inequality 2;^ ^ > is valid, then validity of conditions ((SJ is not necessary 
for these {a,f3) in the sequel. 

Consider the alignment problem. The operation Sp{J) is assigned to execute 

at the virtual processor {t[^\j), . . . ,ti^\j)). The array element ai{Fi^i3^q{J)) is 

stored in the local memory of the processor (dfl\Fi^p^q{J)), . . . , dr\Fijj^q(J))). 

The expressions 5^f''^{J) = 4'^^('^) ~ df\Fi,l3,q{J))^ 1 < S. < r, determine the 

distance between the processors. Assuming S^f'^J) = we obtain conditions 

for communication-free allocation: r'^^'^^ — rj'^^'^^ Fi^i3_q = 0, 5^'''^-' — rj'^^'^^Giji^q — 
zC-'i) = 0, ap,^ - r^'^iA) f(i,l3.q) _ ^ = 0. In the vector-matrix form f(«)Z\f^'^ = 

Introduce in the consideration vector variables zfp^, z^p ^, zf ^ ^. Thus, to 
find operation and data allocation such that a number of communications is as 



small as possible is to minimize (or to put to zero if it is possible) coordinates 
of the vectors zf^ ^, zp^ ^, z/^ ^ which the following equations are valid for 

Here \v\ is a vector whose entries are modules of entries of a vector v. 

4 Conditions of Time and Space Localization 

To obtain time localization is to find functions t^^'' so that values t^^\j) and 

t'-'^\l) satisfying © are as more lexicographically close to each other as it is 
possible (reuse of an array element is as more quicker as these values are closer) . 
We reduced conditions (0 to constraints ©; thus, to achieve our goal is to 
minimize (to zero at best) vectors z^ ^ and Za^p. 

Validity of condition lO, i.e., conditions ^ is necessary for all dependences 
except in-dependences. Write analogues of conditions Q for in-dependences: 

\r^^^Kj - </3 = 0' l^^*^^"./3| - ^"./3 = , (7) 

Thus, requirements of time localization can be reduced to vectors ^ and Za^p 
minimizing (zeroing if it is possible) when conditions (|SJ), Q) are valid. 

To obtain space localization is to use array elements that are stored close 
to each other in memory at the iterations that are close to each other. To be 
definite, assume that we use a programming language C. In this case, storing 
array elements is realized by rows. Thus, the l-th array elements that are stored 
close to each other in memory are those that differ from each other in the last 
coordinate of the index expressions: F/.^^g(J) — Fi,p^q{I) — \ei'i \ X e We 
realize space localization among operations of the same statement for the fixed 
access to array. 

Introduce some notation: -F/./3,g G ig a matrix whose rows are 

rows of the matrix except the last row; r(Z,/3, g) is rang of the matrix 

Fi,p,q, r{l,P,q) < Ufj] d'l'^p^^ = 0, 1 < 7 < n^j — r{l,l3,q)), is a fundamental 
system of solutions of a uniform system of equations Fi^p,qX = 0. 

Theorem 1. Lett^^^ be a multi- dimensional scheduling. Choose functions t^^^\ 

^ € {^1, . . . , S.r(i,i3,q)}j among functions tf\ . . . , t^f\ d < n. Suppose these func- 
tions satisfy conditions 

r'^^'^^d'ijl^ = , 1 < 7 < n/3 - r{l, /3, q) , (8) 
TangTi^,3^q = r{l,(3,q) . 

Here Ti^p^q is a matrix whose rows are vectors r^'^'^^-', . . . ^T^0'ir(i.i3,q)) ^ Then ele- 
ments of only one row of the l-th array are used in the q-th access of the operation 
Sff for fixed values of indices of d outer loops. 



That is, to obtain space localization is to get r{l,(3,q) linear independent 
vectors t^I^'^") that satisfy condition values ^ are intended to be as small as 
possible. Condition JSJ can be written in the vector-matrix form T^^^Z?i_^_q = 0. 

Thus conditions of space localization can be reduced to vectors -Zi,/3,g mini- 
mization (zeroing if it is possible) when the following conditions are valid 

|r«)A./3.«| -^i,/3,« = . (9) 



5 Procedure of AfRne Transformation of Loop Nests 

Introduce some notation: D^"^ and D^^'f are sets of matrices D'^ ^ and Da^is 

accordingly that describe flow-, out- and anti-dependences; D^^^ and D^l^ are 
sets of matrices and D^p accordingly that describe in-dependences; Dp, 

Do, Df are sets of matrices Z\f^ ^, Z\p^ ^, and vectors Z\f^ ^ accordingly; Di^^ 
is a set of matrices D;,^,,; sj^^^ = 'Z'^^; T^^^ = O*"''); r/|^,^ is a matrix whose 
rows are vectors t^^^'^'^ , 1 < d < f , satisfying condition JSJ; i'^^-' = { /3 | n — ^ -|- 
1 = - rang T^'J_i }; p(</3' ^",/3. ^iS,,^ ^z^/s.g^ ^{/j.g. ^i,/?,?) = E (^a,/3</3 + 

a, ,3 

Aa,,32a,/3)+ E ,q'4.ti ,q + >H,(} ,q^tsi >H3 ,q4,li ,^ ^ E Aj^^^^Z;,^^,. Thc SUm 

X; is over all a,^ such that i:»„,;3 £ i*^^^ U G D^^'^ U i:»J|„, and the 

sum ^ is over ah l,ii,q such that A,/3,r} £ ^f^f,? ^ ^^'j ^^/3,g ^ 

'^{/3,9 6 ,3, A„^;3, A/;^^ ,^, Ap^ ,^, A{^ ,^, A,,^,, are weights. The sets 

and i:)^*|„, Z?,^^^ consist of matr ices i'a^, fa./j, the sets £>i^' consist of 
matrices Di ^^q from conditions jSJ, {Tj), and Q whom the vector r^^', ^ > 1, is 
to satisfy. 

Coordinates of the weights correspond to columns of the matrices D"^ 

Da,0, Afp ^, Afp Di^fj^q and vectors Z\f^ ^. Suppose a column of a matrix 
or a vector is found 7 times; then the larger 7 the greater role of this column 
or vector in minimization of the number of communications between processors 
and in improvement of locality. Thus, the value of appropriate coordinate is to 
be larger. The weights can also express the preference for the choice of operation 
and data allocation. Suppose it is desirable that there is no exchange of elements 
of some array a^o, then the weights zf^ ^, zf^ ^ ^, ^ ^ are to be larger then 
the others. 

To find the vectors r'^-* it is necessary to minimize values of the variables ^, 

Zq,/3, z[p ^, zp^ ^, z( p zi^p^q. Thus, to find these vectors is to solve the following 

optimization problem. Choose a vector s^p € ^jsK P £ L^^\ and minimize the 
value of the function p, the following condition being valid: condition ^ for 
(3 e conditions (O if C < r, and conditions ©, ©, 0- 



mm 



The following procedure summarizes the previous investigations. The aim of 
the procedure is to find a multi-dimensional scheduling and data allocation sat- 
isfying the condition of communication-free allocation and the condition of space 
and time localization. The procedure is recursive and consists of n recursions. 
The ^th recursion results in getting a vector r*^^^ . 

Procedure (finding scheduling and allocation functions): Put C = 1- 
Step 1. Choose a vector 8"^ G S'^p , (3 £ L^^'. Find a vector t'^^ by solving the 
optimization problem 

in {p(€,f3^ ^c,,r3, zfp^^, zfp^^, zlf^^^, zi^0,q) condition ©, /3 G L^^\ 
condition ©, G d[^\ G D^f'\ 

condition O, Dip G of^^^.D^^p G d'£ , 
condition ®, Z\G^^ g Z?g, zi{^^^ G Df, Af^^^ & Dp, ^ < r, 
condition Q, A,/3.9 e D^P^. 

Step If ^ < r + 1 then define sets: £1^*+^^ = d':^\ ZJ^^+i) = 
If ^ > r + 1 then define sets: 

= D^P\ { Dip I zip > }, = Di^)\ { D^^p I zip > }. 

Step 3. Define sets: 

= { e I rang T/g^ < r(?, /3, g) }. 

Step 4. Define a set = {/3|n-C = n;3- rang T^^.'^^ }. 

Step 5. If ^ = n then go out the procedure else increase ^ by I and go to step 1. 



6 Data Exchange Sequence 

Suppose for some fixed parameters g,^ the conditions of communication- 
free allocation are not valid (i.e., even one of variables zfp ^, zfp zf p ^ is not 
equal to zero at some recursion of the procedure). Then it is necessary to pass 
elements of array ai for using them for the q-th input of elements of array ai 
into instruction Sp. 

By P{zi, . . . , Zr) denote a processor allocated at the point (zi, . . . , Zr) of vir- 
tual processors space. According to the functions and Z''^'', the array elements 
0'i{Fi,l3,q{J)) are stored in the local memory of the processors P{di\Fi^p^q{J)), 
. . . ,dr\Fi^p^q{J))) and they are used in the processors P{t'f\j), . . . ,tf\j)) 
at the iterations (t^'^\(J), . . . ,tif\j)). In the general case, point-to-point com- 
munications can be organized between pairs of these processors.) 

For the program execution time to be smaller it is desirable to determine 
prompt communications such as broadcast, gather, scatter, reduction, and data 
translation. Consider for example broadcast. 

Let F be an element of the set Wi. Denote by Vi P^ = { J G Vg | Fi^p^q{J) = 
F } the set of such iterations of the initial loop nest that the array element a; (F) 
is used at them for the g-th input of elements of array ai into instruction Bp. 



The set Vj'^^^ is called non-degenerate if dim(kerF;^^^g) ^ and there exists a 

vector Jo € ^if^ ^ such that Jo + ttj € V^, where Ui is any base vector of the 
intersection kerF^^^ ^ and TZ"^^ . 

Let ^, . . . , n^ip^f'"^^^ be a fundamental system of solutions of a uniform 
system of equations Fi^fj^qX = 0. 

Theorem 2. Suppose the set V^^J_g is non-degenerate; the function Fi^^^q occurs 
in the right part of the instruction S^] conditions 

^''^'^^^tlq = 0' ^ + 1 < ^ < n, 1 < C < C(«, q) , 
and one of the following conditions are valid: 

a) the elements of array ai occur only in the right parts of the instructions, 
h) constraints 

are valid for the flow- dependence produced by the q-th input of elements of array 
ai into instruction S^. 

Then to pass the data ai (F) it is possible to arrange broadcast from the processor 
P{d^^\Fi,0jJ)), . . .,di'\Fi,p^g{J))) to the processors P{t[^\j), • • ■ ,ti^\j)) at 
the iteration ■ ■ ■ ,ti^\j)), J € V, J^^. 

7 Conclusion 

In this paper, we propose a method of mapping algorithms for parallel execution 
onto distributed memory parallel computers. The method provides with deter- 
mination of operation and data allocation over processors, an execution sequence 
of operations, and data exchange necessary for the program execution. The aim 
is to minimize a number of communications, to improve locality of an algorithm, 
and to determine the possibility of broadcasts. 
Note some advantages of the method suggested: 

- an initial algorithm is represented by afSne loop nests of an arbitrary nesting 
structure; 

- the suggested conditions can be simply obtained from a source algorithm; 

- the conditions do not depend on the definite values of outer variables; the 
obtained functions depend on outer variables parametrically; 

- the method can be automated. 

The method was applied for mapping algorithms for matrix transformations 
onto distributed memory parallel computers. These algorithms was implemented 
on the supercomputer SKIF (it is located at NAS of Belarus, Minsk). 
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